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Abstract — In this paper, we propose a novel family of window- 
ing technique to compute Mel Frequency Cepstral Coefficient 
(MFCC) for automatic spealser recognition from speech. The 
proposed method is based on fundamental property of discrete 
time Fourier transform (DTFT) related to differentiation in 
frequency domain. Classical windowing scheme such as Hamming 
window is modified to obtain derivatives of discrete time Fourier 
transform coefficients. It has been mathematically shown that the 
slope and phase of power spectrum are inherently incorporated 
in newly computed cepstrum. Speaker recognition systems based 
on our proposed family of window functions are shown to 
attain substantial and consistent performance improvement over 
baseline single tapered Hamming window as well as recently 
proposed multitaper windowing technique. 

Index Terms — Differentiation in frequency, Power Spec- 
trum Estimation, Speaker Recognition, Tapered Window, Mel- 
frequency cepstral coefficients (MFCC). 

I. Introduction 

MEl frequency cepstum coefficient (MFCC) extraction 
schemes use discrete Fourier transform (DFT) for cal- 
culating short-term power spectrum of speech signal. Dur- 
ing this process, Hamming or Hanning window is applied 
to raw speech frames in order to reduce spectral leakage 
effect. These windows have reasonable sidelobe and main- 
lobe characteristics which are required for DFT computation. 
However, there exists various other window functions which 
also have good behavior in terms of certain parameters of 
their frequency responses [IJ- In practice, selecting the optimal 
window function for speech processing application is still 
an open challenge f2\. Recently, alternatives of Hamming 
window have drawn attention of the researchers [i3J, [4J- For 
example, performance of speaker recognition systems based on 
MFCC, extracted using multitaper window function, are shown 
comparatively robust than existing single tapered Hamming 
window based approach flSj- 

In this work, we propose a simple time domain processing 
of speech after it is multiplied with a standard window. The 
processing is based on well-known difference in frequency 
property of discrete time Fourier transform [6 , and it can 
be easily integrated with standard window during DFT com- 
putation. Due to the proposed modification, we inherently 
compute derivative of Fourier transform. Power spectrum 
is computed from those differentiated Fourier coefficients. 
There are evidences that speaker discriminating attribute is 
present in slope of power spectrum [71 as well as in phase 
information |8|. In this paper, we have mathematically shown 
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that our proposed technique integrates both slope and phase 
information with magnitude spectrum. Therefore, it can be 
hypothesized that the speech feature extraction from these 
modified Fourier coefficients will give better recognition per- 
formance. We have evaluated the performance in multiple 
databases for speaker verification (SV) task, and consistent 
performance improvement is achieved over Hamming window 
based baseline system. 

The rest of the paper is organized as follows. In Section |Il] 
we describe the proposed windowing scheme and its features. 
In addition to that, the effect of newly introduced window 
in power spectrum computation is mathematically analyzed. 
Experimental results are shown in Section |III] Finally, the 
paper is concluded in Section |IV] 

II. Proposed Windowing Method 

A. Design of proposed window function 

Let x{n) be a windowed speech frame of length N and its 
DTFT is given by, X{e^^). We know from differentiation in 
frequency property ||6] that DTFT of nx{n) can be written as. 
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As DFT coefficients X{k) are samples of DTFT at = 
DFT of nx{n) are discrete samples of X(e-'") at 
cj = Therefore, X{k) = X(eJ")|^^2^ are the DFT 

coefficients of nx{n). 

Since x{n) is a windowed speech frame, it can be rep- 
resented as w{n)s{n), where s{n) is raw speech frame and 
w{n) is window function. We propose new window function 
as w(n) — nw{n). The windowed speech frame is then 
represented as x{n) = w{n)s{n). 

From generalization of differentiation in frequency property, 
we can write that, for an integer r, DTFT of n'^x{n) is 
j^^^-^T^ — '■ Therefore, the proposed window function of r-th 
order window can be written as n'^w{n). Standard Hamming 
window can be viewed as zero order window of proposed 
family. The window functions are shown in Fig. [T| for first 
and second order along with Hamming window. Note that 
in contrast to frequently used window functions, the newly 
introduced family of window functions is asymmetric and non- 
tapered. 

B. Characteristics of the proposed window function 

Commonly, the effectiveness of an window function is 
judged by different performance metrics [1]. In order to 
evaluate the performance of the window in DFT compu- 
tation, various performance metrics are computed prior to 
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Fig. 1. Comparison of Hamming window (blacli) with first (blue) and second 
(red) order differentiation based window in (a)time domain and (b)frequency 
domain for a window of size 160 samples. Amplitude of all the window 
functions are normalized to one for visual clarity. 



the application of this window function in speech feature 
extraction. We have calculated three widely used performance 
evaluation metrics: spectral leakage factor, relative side lobe 
attenuation, and mainlobe width (— 3dB) of the Hamming 
and proposed windows of different orders. The results are 
shown in Table I] for window size of 160 samples. It can be 
observed that with the increase of order, the spectral leakage 
increases and sidelobe attenuation decreases to some extent 
which have minor effect in recognition performance. However, 
considerable increase in mainlobe width will help to estimate 
smooth power spectrum, and that is expected to improve 
recognition performance [9 |. 

TABLE I 

Performance metrics of various window functions. Sequence length is 160 
samples i.e. 20ms for sample rate 8kHz. 
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On the other hand, magnitude spectrum of the modified 
signal can be written as. 
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Now, if we consider that '^"^j^^"-' = a{uj) cos ip{Lj) 

dm 



tan 



a{uj) smif{uj), then a{uj) 



^{^y+{^f and = tan- 

-1 dXiju) 
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Therefore, from Equation ( |3}, 

a{io) = H{uj) 



On the other hand, if we put 
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Window 


Leakage 
Factor 


Relative 
Sidelobe Attenuation 


Mainlobe 
Width 


Hamming 


0.04% 


-42.6 dB 


0.015625 


r = 1 


0.06% 


-42.6 dB 


0.017578 


T = 2 


0.17% 


-37.9 dB 


0.018555 



C. Ejfect of the proposed window in power spectrum compu- 
tation 

In this subsection, we find out a mathematical connection 
between power spectrum of proposed windowed speech frame 
and power spectrum of original Hamming windowed speech 
frame. 

Let us assume that power spectrum of Hamming windowed 
signal is given by and power spectrum of the proposed 

window is P{uj). Therefore, P{uj) = H'^{lo) = \X{e^'^)?' and 
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where (h(uj) — tan ^''^'^^ 
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Therefore, from Equation ( 2) and Equation ( |5), we get. 
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Finally, we can write the final expression of the output 
power spectrum P{uj) as. 
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where H{uj) and H{iLi) are The term 



magnitude spectrum of two signals respectively. Now, since 
X(e-'") can be decomposed into a real, Xji{uj) and imaginary, 
Xj{uj) part, the slope of magnitude spectrum of Hamming 
windowed speech signal can be written as. 



dPju 
duj 



in Equation ( |7ll corresponds to the slope 



of the power spectrum of the Hamming windowed speech at 
frequency ui. Hence, as a consequence of power spectrum 
computation from derivative of fourier transform, we obtain 
a modified power spectrum which is related to the slope of 
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original power spectrum. Apart from it, the newly formulated 
power spectrum is also related to phase spectrum of the signal 
Using a more complicated computation, it can also be 
shown that the higher order version of proposed differentiation 
window (e.g. for r > 1) will compute power spectrum with 
higher order derivative of P{uj). 

The modified DFT magnitude coefficients are nothing but 
the samples of H{uj) at uj = ^j^. Therefore, mel cepstrum 
computation using proposed window integrates the slope of 
power spectrum, phase, and of course, power spectrum of the 
signal. It is expected that the speech feature will be more 
efficient compared to the standard cepstrum which is solely 
based on power spectrum. 

III. Experimental Setup and Results 

A. Speaker Recognition Setup 

1) Database: SV experiments are conducted on multiple 
large population NIST corpora for obtaining statistically sig- 
nificant results. We have chosen SRE 2001, SRE 2004, and 
SRE 2006. The database descriptions for current experiments 
are briefly shown in Table |lll 

TABLE II 

Database description ( coretest section ) for the performance evaluation of 
various window functions. 





SRE 2001 


SRE 2004 


SRE 2006 


Target Models 


74c?, 100$ 


246d', 370$ 


354d', 462$ 


Test Segments 


2038 


1174 


3735 


Total Trial 


22418 


26224 


51068 


True Tiial 


2038 


2386 


3616 


Impostor Trial 


20380 


23838 


47452 



2) Feature Extraction: MFCC features have been extracted 
for different types of window functions. 38 dimensional fea- 
ture vectors are computed using 20 filters linearly spaced 
in Mel scale from speech frames of size 20ms (with 50% 
overlap). Detailed explanation of used MFCC computation 
technique is available in [TJ. 

3) Classifier Description: State-of-the art speaker recog- 
nition system uses Gaussian mixture model-universal back- 
ground model (GMM-UBM) based classifier |10|. The speech 
data for UBM training are taken from development data of 
SRE 2001 and training section of SRE 2003 for the evaluation 
of SRE 2001 and SRE 2004 respectively. Number of mixtures 
are set at 256 for these experiments. Here, gender dependent 
GMM clusters are initialized using binary split based vector 
quantization. The final UBM parameters are estimated using 
EM algorithm. Target models are created by adapting only the 
means of the UBM with relevance factor 14. During the score 
computation, top-5 Gaussians of corresponding background 
model per each frame are considered. 

For the evaluation of SRE 2006, the GMM-UBM system 
is trained with 512 mixtures of gender dependent UBM with 
complete one side training data of SRE 2004 (i.e. 246 male and 
370 female utterances). z<-score normalization is performed 
on raw score of GMM-UBM system. Normalization data is 
obtained from one side section of SRE 2004. Experiments are 
also conducted using classifiers based on GMM supervector 



and support vector machine (GSV-SVM) ifTTI . This is based on 
the same UBM of GMM-UBM system. The negative examples 
of SVM are obtained from the same data used for UBM 
preparation. Experiments are also carried out with nuisance 
attribute projection (NAP) based channel compensation tech- 
nique |12|. Channel factors are obtained using the speech 
signals of SRE 2004. All together, 699 utterances of 101 male 
and 905 utterances of 142 female are utilized to train the NAP 
projection matrix of co-rank 64. 

B. Results 

Speaker recognition experiments are carried out with differ- 
ent window function keeping other blocks identical i.e. pre- 
processing, feature extraction and classification are precisely 
same for all various window based systems. We first evaluate 
the performance on SRE 2001 and SRE 2004 with classical 
GMM-UBM system. The performance of proposed windows 
(first and second order) are compared with single tapered 
Hamming window as well as recently proposed multitaper 
window. The performance has been evaluated with multipeak 
taper of size (denoted by k in Table HIH) 6 and 12 as 
mentioned in |T3l, ||5|. The results are shown in Table Hill and 
corresponding detection error trade-off (DET) plots are shown 
in Fig.|2i) and Fig.|2|ii)- Equal error rate (EER) and minimum 
detection cost function (minDCF) of SV systems based on 
newly proposed window functions are consistently better for 
both the databases. In comparison with baseline Hamming 
window based system, we have obtained 0.6% and 7.74% 
relative improvement in EER, and 0.26% and 5.59% relative 
improvement in minDCF for SRE 2001. In contrast, for SRE 
2004, the relative improvements in EER are 1.96% and 4.26%, 
and for minDCF these are 1.15% and 3.45%. Interestingly, we 
have observed that multitaper windowing techniques do not 
give better performance as compared to proposed method. 

In Table IIVI the performance is shown for different classi- 
fiers on SRE 2006. Also, in this case, we have achieved con- 
sistent and reasonable performance improvement for proposed 
window based SV system. The DET plot is shown in Fig.|2iii) 
for both GMM-UBM and GSV-SVM (with NAP) system. We 
can easily interpret from the curves that SV system based on 
the proposed window functions are consistently better than 
Hamming window based baseline system. It is also observed 
that performances of second order window based systems are 
better than first order window based system. 

IV. Conclusion 

In this paper, we have focused on the usage of a class 
of window functions by which more effective speech feature 
can be computed. The newly formulated feature represents 
the power spectrum of the original spectrum as well as 
its derivative. In addition to that, it also integrates phase 
information which is also relevant for speaker recognition. 
Speaker recognition system based on proposed windowing 
schemes are evaluated on different NIST databases. We have 
achieved consistent performance improvement over baseline 
Hamming window based technique on various combinations 
of classifiers and databases. 
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TABLE III 

SV performance on NIST SRE 2001 and NIST SRE 2004 using various window functions for GMM-UBM based system. 



Window 


NIST SRE 2001 


NIST SRE 2004 


Type 


EER (in %) 


minDCF X 100 


EER (in %) 


minDCF X 100 


Hamming 


8.2434 


3.5763 


14.9629 


6.3231 


Multitaper (k = 6) 


8.0471 


3.5778 


15.2501 


6.4363 


Multitaper (k = 12) 


10.9372 


4.6606 


18.0196 


7.2271 


First Order 


8.1943 


3.5672 


14.6694 


6.2501 


Second Order 


7.6055 


3.3763 


14.3255 


6.1050 



TABLE IV 

SV performance using various systems for different window function. 



Window 


GMM-UBM 


GSV-SVM 


GSV-SVM (with NAP) 


Type 


EER (in %) 


minDCF X 100 


EER (in %) 


minDCF X 100 


EER (in %) 


minDCF X 100 


Hamming 


11.4493 


4.4702 


8.8471 


4.0330 


6.6419 


3.1161 


Multitaper (fe = 6) 


11.6981 


4.5493 


9.0705 


4.2211 


6.8886 


3.2725 


Multitaper (k = 12) 


14.2971 


5.2299 


11.430 


5.0286 


8.2416 


3.9699 


Proposed First Order 


10.9856 


4.3521 


8.3792 


4.0233 


6.2503 


3.0961 


Proposed Second Order 


10.7559 


4.2627 


8.3242 


3.9555 


6.1359 


3.0646 
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Fig. 2. DET plots of different window based systems (Black: Hamming, Blue: first order, Red: second order) are shown for (i)SRE 2001, (ii)SRE 2004, 
( iiijSRE 2006. In subfigure ( Hi), the dotted lines show results for GSV-SVM system with NAP. 
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